Towards Large Scale Environmental Data Processing with Apache Spark
نویسندگان
چکیده
Currently available environmental datasets are either manually constructed by professionals or automatically generated from the observations provided by sensing devices. Usually, the former are modelled and recorded with traditional general-purpose relational technologies, whereas the latter require more specific scientific array formats and tools. Declarative data processing technologies are available both for relational and array data, however, the efficient declarative integrated processing of array and relational environmental data is a problem for which a satisfactory solution has still not been provided. Due to the above, an integrated data processing language called MAPAL has been proposed. This paper provides a brief description of the design decisions and challenges, related to data storage and data processing that arise during the ongoing implementation of MAPAL on top of the Apache Spark large scale data processing framework.
منابع مشابه
A comparison on scalability for batch big data processing on Apache Spark and Apache Flink
*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...
متن کاملDdup - towards a deduplication framework utilising apache spark
This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14]...
متن کاملOn the usability of Hadoop MapReduce, Apache Spark & Apache flink for data science
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently...
متن کاملGeoSpark: A Cluster Computing Framework for Processing Spatial Data
This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three nove...
متن کاملBig Data Approaches for the Analysis of Large-Scale fMRI Data Using Apache Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the Human Connectome Project
Technologies for scalable analysis of very large datasets have emerged in the domain of internet computing, but are still rarely used in neuroimaging despite the existence of data and research questions in need of efficient computation tools especially in fMRI. In this work, we present software tools for the application of Apache Spark and Graphics Processing Units (GPUs) to neuroimaging datase...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016